I. Introduction: ROGELIO

II. Data Cleaning and Outlier Visualization: ROGELIO

#PART 1: Read csv, merge, clean and plot outliers.
library(readr)
library(readxl)
library(dplyr)
library(countrycode)
library(car)
source('Read_Clean.R')
cleaned <- Read_Clean()

III. Dimension Reduction Analysis: MIKA & ME

The data set contains 17 variables. In order to easly present them, dimension reduction analysis were computed. Two techniques were used: Multidimensional Scaling (MDS) and Principal Component Analysis (PCA).

Part A: Multidimensional Scaling (MDS)

In order to provide a general insignis into the data, all countries were presented in 3-dimensional space. At the first glance clusters between continents can be seen. Countries which are in the same continent in general present a similar profile. The most diverse continent is Asia with many outliers. Countries in Asia spread from Europe (the one end) and Africa (the second end). It can be seen that North America and South America are similar to each other.

image

image

library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 3.5.2
source('MDS.R')
## Warning in system(my_command): 'rm' not found

Part B: Principal Components

The Principal Component Analysis (PCA) was used in order to provide insights into the data and visualize it in two-dimensional plots. Three principal components were presented on the plots below and their cumulative proportion of variance is 74%.

Interpretation of PC1, PC2, and PC3 is as follows: PC1: is highly loaded in variables such as number of phones, life expectancy, Corruption index, Acces to the Internet and Income.
PC2: is highly loaded in the number of suicides and sex ratio. PC3: is especially meaningful in the context of inequality.

image

image

In order to present the PCA result, two graphs were displayed below. Ellipses were added to the graphs which shows a concentration of points. Ther size is influenced by outliers.

Plot PC1 vs PC2 -On the right of the plot with a high value of PC1 hight developed countries in Europe and North America can be spotted. Those contents are above the average in the context of Less corruption, Life expectancy, Internet access, number of phones and income.

  • On the left side of the plot, with a low value of PC1 not less developed countries in Africa can be spotted. Those countries are above the average in the context of high child mortality, a number of children per woman and inequality.

  • Interesting phenomena is presented by looking at Asia. The continent is the most diverse among all of the others in both directions PC1 and PC2. Some countries in Asia are highly developed while others are rather poor (PC1). In the context of PC2, some countries have extreme value for sex ratio (men outnumber women significantly). Those countries are Qatar and the UAE.

  • PC1 and PC2 do not give us many insights into Central America nor South America. Since values for these continents are in the middle of the plot.

PrinCompPlot[1]
## [[1]]

Plot PC2 vs PC3 The second plot shows that very high inequality is presented especially in South America and Africa

The above plots show also that there is a high correlation between variables: number of phones, less corruption, Internet access and income. Another group of highly correlated variables are child mortality and a number of children per woman.

PrinCompPlot[2]
## [[1]]

PCA on the World Map In order to show which countries are the highest in what Principal Component the World Map was presented. From each component (PC1, PC2, and PC3) top 15 countries with the highest loading in each group were chosen and plotted on the map.

PrinCompPlot <- PCA(cleaned)

Note: from the analysis columns such as Population total, number of murder, number of armed forces, urban population total and percentage of investments are excluded. Those variables had a low correlation with the rest of the columns and much more dimensions would be needed to explain the data. As such information would not be possible to be explained in 2-dimensional plot.

IV. CLUSTERING - ROGELIO

PART A: Hierarchical Clustering between Continents

# PART 3: Hierarchical Clustering between Continents
library(ape)
source('cluster_continents.R')
Cl_continents <- cluster_continents(cleaned)

Include all variables

South, North and Europe are very similar. AND C America, Asia, Oceania and Africa are similar. Interesting is Africa is clustered with Oceania (with include Australia and NZ but also many small island which push Oceania into level of Africa)

PART B: K-means and Model-Based clustering between Conutries

Jereamy both

# PART 4: K-means & Model Based Clustering between Countries
library(mclust)
source('clusters_countries.R')
Cl_countries <- clusters_countries(cleaned)

V. Exploratory Factor Analysis -MIKA

In order to find the number of factors, EFA was performed starting with 1 factor, increasing the number of factors until getting a value for RMSE lower than 0.05. Therefore, it was concluded that the optimal number of factors is four. Performing EFA with 4 factors, the loadings are:

#PART 5: EFA
source('EFA.R')
EFA_loadings(cleaned)
## 
## Loadings:
##                       Factor1 Factor2 Factor3 Factor4
## pop_total                      0.995                 
## murder_pp                              0.825         
## armed_pp                                             
## phones_p100            0.615                         
## children_p_woman      -0.918                         
## life_exp_yrs           0.875                         
## suicide_pp                                           
## urban_pop_tot                  0.958                 
## sex_ratio_p100                                 0.538 
## corruption_CPI         0.538                         
## internet_%of_pop       0.847                         
## child_mort_p1000      -0.940                         
## income_per_person      0.575                   0.768 
## investments_per_ofGDP                                
## gini                                   0.625         
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      4.439   1.949   1.262   1.153
## Proportion Var   0.296   0.130   0.084   0.077
## Cumulative Var   0.296   0.426   0.510   0.587

From the loadings it can be interpreted:
1. Factor 1 has high life expectancy, internet access, balanced income per person and it is low in child mortality and children per women. For these reasons, represents the level of development of the country. 2. Factor 2 represents the level of population. 3. Factor 3 represents inequality and murder. 4. Factor 4 represents the level of income related with the amount of men and women that the country has. In order to visualize these four factors graphically, the top 10 for each factor’s scores was taken and create four groups of countries, where each factor has more relevance. The groups of countries are named according with the meaning of each factor as follows: Factor 1 -> Developed Factor 2 -> Crowed Factor 3 -> Inequality Factor 4 -> Gender/Income These can be visualized in the following graph:

source('EFA.R')
groups = EFA_plot(cleaned)

Note: There are some countries such as Singapore or Qatar that are in the groups but are too small to show in the map.

The countries in group 1 are:

library(knitr)
print(groups[1])

[[1]][1] “Spain” “Estonia” “Finland” “Switzerland”
[5] “Andorra” “Austria” “Singapore” “Liechtenstein” [9] “South Korea” “Japan”

The countries in group 2 are:

library(knitr)
print(groups[2])

[[1]][1] “Japan” “Russia” “Bangladesh” “Pakistan”
[5] “Nigeria” “Brazil” “Indonesia” “United States” [9] “India” “China”

The countries in group 3 are:

library(knitr)
print(groups[3])

[[1]][1] “Brunei Darussalam” “Swaziland” “Honduras”
[4] “Brazil” “Guatemala” “Colombia”
[7] “Venezuela” “Lesotho” “South Africa”
[10] “El Salvador”

The countries in group 4 are:

library(knitr)
print(groups[4])

[[1]][1] “Saudi Arabia” “Monaco” “Norway”
[4] “Ireland” “Kuwait” “Brunei”
[7] “Singapore” “United Arab Emirates” “Luxembourg”
[10] “Qatar”

VII. Conclusion ->